Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance
نویسندگان
چکیده
Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability distributions, we have adapted it in order to obtain a distance value between two documents. We have carried out experiments over two different narrowdomain corpora and our findings indicates that it is possible to use this measure for the addressed problem obtaining comparable results than those which use the Jaccard similarity measure.
منابع مشابه
Using Kullback-Leibler distance for performance evaluation of search designs
This paper considers the search problem, introduced by Srivastava cite{Sr}. This is a model discrimination problem. In the context of search linear models, discrimination ability of search designs has been studied by several researchers. Some criteria have been developed to measure this capability, however, they are restricted in a sense of being able to work for searching only one possibl...
متن کاملModel Confidence Set Based on Kullback-Leibler Divergence Distance
Consider the problem of estimating true density, h(.) based upon a random sample X1,…, Xn. In general, h(.)is approximated using an appropriate in some sense, see below) model fƟ(x). This article using Vuong's (1989) test along with a collection of k(> 2) non-nested models constructs a set of appropriate models, say model confidence set, for unknown model h(.).Application of such confide...
متن کاملComparison of Kullback-Leibler, Hellinger and LINEX with Quadratic Loss Function in Bayesian Dynamic Linear Models: Forecasting of Real Price of Oil
In this paper we intend to examine the application of Kullback-Leibler, Hellinger and LINEX loss function in Dynamic Linear Model using the real price of oil for 106 years of data from 1913 to 2018 concerning the asymmetric problem in filtering and forecasting. We use DLM form of the basic Hoteling Model under Quadratic loss function, Kullback-Leibler, Hellinger and LINEX trying to address the ...
متن کاملEvaluating the Improvement of Partial Discharge Localization Accuracy Using Frequency Response Assurance Criterion
Partial Discharge (PD) is the most important source of insulation degradation in power transformers. In order to prevent catastrophic failures in transformers, PDs need to be located as soon as possible so that maintenance measures can be taken in time. Due to the structural complexity of windings, locating the PD source inside a transformer winding is not a simple task. In this paper, the effi...
متن کاملGeometric Clustering of Multimedia Databases
Many kinds of texts are now available in various types of databases, and it has been requested to develop new methods to fully utilize them in a wide range of applications. In the eld of information retrieval of full text databases, the vector-space model has been developed over 20 years (Salton et al. [14, 13]), and further the so-called latent semantic indexing based on the singular value dec...
متن کامل